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SUPPORTING WEB-QUERY EXPANSION 

EFFICIENTLY USING 
MULTI-GRANULARITY INDEXING AND 

QUERY PROCESSING 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

This invention relates generally to field of indices and 
queries applied to a collection of documents in a database. 
More specifically, this invention relates to efficient expan- 
sion and processing of the queries, to reducing the size of 
indices used to perform the query expansion and to progres- 
sive query processing. 

2. Description of the Related Art 

Conventional retrieval systems, by which documents may 
be retrieved through the application of queries, are based on 
a common set of principles and methodologies of catego- 
rizing documents. Documents are normally indexed manu- 
ally by subject experts or librarians using pre-specified and 
controlled vocabularies. Alternatively, documents can be 
indexed based on the words included in the documents. 
Users can search documents using terms from the accepted 
vocabularies, together with appropriate boolean operators 
between them. In this type of system, an exact match 
strategy is used. Although this approach has many 
advantages, such as simplicity and high precision, it suffers 
from the problem of word mismatch. 

The problem of word mismatch in information retrieval 
occurs because people often use different words to describe 
concepts in their queries than authors use to describe the 
same concepts in their documents. FIG. 1 shows that words 
used in HyperText Markup Language (HTML) documents 
related to the words "car" and "dealer" may vary from one 
document to another. Languages other than HTML, such as 
Extensible Markup Language (XML) and Standard Gener- 
alized Markup Language (SGML), may be used. If a user 
uses a query with the words "automobile" and "dealer," he 
or she cannot retrieve all the relevant'documents due to word 
mismatch problems. 

Query expansion has been suggested as a technique for 
dealing with this problem. Such an approach would expand 
queries using semantically similar words (e.g. synonyms or 
other semantically related words) and syntactically related 
words (e.g. words co-occurring in the same document above 
a certain frequency are syntactically co-occurring words) to 
those words in the query to increase the chances of matching 
words in relevant documents. When query expansion is 
used, the "car dealer" query is expanded as follows to 
include terms with similar meanings: 

Line 1. [("car" OR "automobile" OR "auto" OR "sedan" ) 
OR 

Line 2. ("Ford" OR "Buick")] AND 
Line 3. ("Dealer" OR "Showroom" OR "SalesOffice") 
There are two types of query expansion involved in this 
example. The query expansions in Line 1 and Line 3 are 
adding additional words related to car and dealer by lexical 
semantics, i.e. words which are semantically similar. 
Automobile, auto, and sedan are words having a similar 
meaning to the word car. Similarly, Showroom and Sale- 
sOffice have meanings similar to the word dealer. The other 
type of query expansion, shown in Line 2, is by, for example, 
syntactical co-occurrence relationships. A large number of 
the words used on the World Wide Web ("the Web") are 
actually proper names, which cannot be found in lexical 
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dictionaries. Examples of proper names include Ford, Buick, 
NBA, and National Football League. As noted above, syn- 
tactical co-occurrence relationships are derived from analy- 
sis on the frequency of two words co-occurring in the same 

5 document. This is based on the assumption that there is a 
higher chance that two words arc related if they appear 
frequently together in the same document. For example, the 
co-occurring words with Ford could be dealer, body shop, 
Mustang, Escort, etc. 

10 To support query expansion, indices of words related by 
lexical semantics and syntactical relationships, such as 
co-occurrence, need to be maintained. The indices for 
related words by lexical semantics can be constructed as a 
hierarchical structure (see e.g. W. Li et al., "Facilitating 

15 Multimedia Database Exploration through Visual Interfaces 
and Perpetual Query Reformulations," Proceedings of the 
23rd International Conference on Very Large Data Bases, 
pages 538-547, Athens, Greece, August 1997), a semantics 
network (see e.g. G. A. Miller, "Nouns in WordNet: A 

20 Lexical Inheritance System" In International Journal of 
Lexicography 3 (4), 1990, pages 245-264), or hierarchical 
clusters of associated words (see e.g. G. Sal ton et al., "The 
SMART and SIRE Experimental Retrieval Systems," pages 
118-155, McGraw-Hill, New York, 1983). Since syntactical 

25 relationships, such as syntactical co-occurrence 
relationships, are binary, the size of syntactical relationship 
indices can be extremely large. Some techniques have been 
proposed for stemming. See e.g., G. Grefenstette, "Use of 
syntactic context to produce term association lists for text 

30 retrieval," Proceedings of the Fifteenth Annual International 
ACM SIGIR Conference, Denmark, 1992; J. Xu et al., 
"Query Expansion Using Local and Global Document 
Analysis," Proceedings of the 19th Annual International 
ACM SIGIR Conference, Zurich, Switzerland, 1996; and C. 

35 Jacquemin, "Guessing Morphology from Terms and 
Corpora," Proceedings of the 20th Annual International 
ACM SIGIR Conference, Philadelphia, Pa., USA, 1997. 
Such techniques include analysis of occurrence frequency, 
and employing morphological rules (e.g. converting all 

40 words to root form) or lexical dictionaries. However, the size 
of indices for words associated by syntactical co-occurrence 
relationships is too large to search efficiently. 

A substantial amount of work on the problem of word 
mismatch has been done in the area of information retrieval 

45 (IR). See e.g. G. Salton et al., "Introduction to Modem 
Information Retrieval," McGraw-Hill Book Company, 
1983; G. Salton, "Automatic Text Processing: The 
Transformation, Analysis, and Retrieval of Information by 
Computer" Addison- Wesley Publishing Company, Inc., 

50 1 989; and K. Sparck Jones et al., "Readings in Information 
RetrievaT Morgan Kaufinann, San Francisco, Calif., USA, 
1997. However, much of the work has been directed to the 
study of retrieval measures such as recall and precision. 
Although some work has suggested ways to efficiently 

55 support query expansion (see e.g. C. Buckley et al., "Auto- 
matic Query Expansion Using SMART," Proceedings of the 
3rd Text Retrieval Conference, Gaithersburg, Md., 1993) 
and indexing mechanisms, two problems have persisted 
without an acceptable solution. First, index size is extremely 

60 large since many words in a document collection (e.g. the 
Web) are distinct proper names and each word has a number 
of semantically similar and syntactically related words. 
Second, query processing is expensive since queries are 
expanded by adding additional words. 

65 These problems get worse when dealing with document 
information collected from the Web since the number of 
documents is very large and the words used are extremely 
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diverse, inconsistent, and sometimes incorrect (e.g., typo- e.g., E. Voorhees, "Query Expansion Using Lexical- 
graphical errors). A study has shown that most user queries Semantic Relations," Proceedings of the 17th Annual Inter- 
on the Web typically involve two words. See B. Croft et al., national ACM SIGIR Conference, Dublin, Ireland, 1994. 
"Providing Government Information on the Internet: Expe- One approach uses a thesaurus to expand the query to 
riences with THOMAS," Proceedings of Digital Libraries 5 increase the chances of matching words in relevant docu- 
(DL '95), 1995. However, with query expansion, query mcnts. A study has shown that simply using a general 
lengths increase substantially. As a result, most existing thesaurus provides limited improvement. Id. Many 
search engines on the Web do not provide query expansion advanced techniques have also been proposed. See e.g., O. 
functionality. Kwon et al., "Query Expansion Using Domain Adapted, 
An overview of existing work in the area of query 10 Weighted Thesaurus in an Extended Boolean Model," Pro- 
expansion will now be presented. Query expansion has ceedings of the 3rd International Conference on Information 
received a significant attention in the field of IR. However, and Knowledge Management, 1994; Y. Qui et a]., "Concept 
the focus in the past has been to evaluate the improvements Based Query Expansion," Proceedings of the 16th Annual 
in retrieval measures, i.e., precision and recall, as a result of International ACM SIGIR Conference, Pittsburgh, Pa., 
query expansion. Another research focus has been in the 15 USA, 1993; E. Voorhees, "Query Expansion Using Lexical- 
direction of building dictionaries so as to identify a set of Semantic Relations," supra; and M. W. Berry et al, "Com- 
similar terms for a given query word. However, the existing putational Methods for Intelligent Information Access," 
work has done little to address the problem of efficient supra. Based on the experimental results, automatic query 
processing of queries when they undergo query expansion or expansion, on average, improve effectiveness of retrieval by 
to reduce the size of the indices used to perform query 20 7% to 25%. See C. Buckley et al., "Automatic Query 
expansion and processing. Furthermore, the issue of ranking Expansion Using SMART," supra. 

documents on the basis of exact and similarity matches Alternatively, improvements can be made by including 

remains a difficult problem. syntactically relevant words. This approach is to cluster 

SMART is one of the best known advanced information words based on co-occurrence in documents and to use these 

retrieval systems. See R. T. Dattola, "Experiments with a 25 clusters to expand queries. Since the co-occurrence is a 

fast algorithm for automatic classification," Gerard Salton, binary relationship, the size of such index is usually 

editor, The SMART Retrieval System - Experiments in extremely large. One group has proposed a technique for 

Automatic Document Processing, chapter 12, Prentice-Hall, using corpus-based word variant co-occurrence statistics to 

Inc., Englewood Clifls, N.J., 1971; and G. Salton et al., "The modify or create a stemmer and has demonstrated its advan- 

SMART and SIRE Experimental Retrieval Systems," supra. 30 tage over the approach of using only morphological rules. 

In SMART, each document is represented by a vector of See W. B. Croft et al., "Corpus-Specific Stemming Using 

terms. Each position of the vector represents the weight (i.e. Word Form Co-occurrence," Proceedings of the Fourth 

importance) of corresponding terms in the document. For a Annual Symposium, 1994. The above techniques that 

document collection of M documents with N distinct terms, expand a query term to a set of semantically related terms 

the collection is represented as an M xN matrix. A query is 35 are called global analysis. In query expansion, terms from 

also represented as a vector of terms. The document retrieval relevance feedback can also be added to the subsequent 

is based on similarity computation of the cosine measure of query to improve the effectiveness of retrieval. See G. Salton 

the query vector and each document vector. Other well et al., "Improving retrieval performance by relevance 

known systems include INQUERY. See J. Callan et al., feedback," Journal of the American Society for Information 

"Tree and tipster experiments with inquery," Information 40 Science, 41(4):288-297, June 1990. This is called local 

Processing and Management, 31:327-332, 1995. analysis. A formal study has shown that using global analy- 

Latent Semantic Indexing (LSI) is a technique which sis techniques, such as word context and phrase structure, on 
relies on statistically derived conceptual indices instead of the local set of documents produces results that are both 
individual term retrieval in lexical matching. See R. Harsh- more effective and more predictable than simple local feed- 
man et al., "Indexing by latent semantic analysis," Journal of 45 back. See J. Xu et al., "Query Expansion Using Local and 
the American Society of Information Science, 41:391-407, Global Document Analysis," supra. Each of the references 
1990; and M. W. Berry et al., "Computational Methods for discussed herein, is hereby incorporated by reference. 
Intelligent Information Access," Proceedings of the 1995 However, as noted above, the past work has failed to 
ACM Conference on Supercomputing, 1995. LSI assumes address the problem of efficient processing of queries when 
that there is some hidden or latent structure in word usage, 50 they undergo query expansion or of reducing the size of the 
which needs to be externalized by analyzing the word indices used to perform query expansion and processing, 
occurrence in a document. Hence, documents are viewed as T7vrwCKrrinM 
vectors in a very high dimensional term space and the SUMMARY OF THE INVENTION 
individual elements in the vector represent the frequency of The present invention provides a solution to the problem 
occurrence of a particular term in a given document. More 55 of word mismatch and resulting inefficient query processing 
sophisticated measures based on both global and local via a method and apparatus for efficient query expansion 
weightings can also be used. A truncated singular value using reduced size indices and for progressive query pro- 
decomposition (SVD) is used to estimate the structure in cessing. More specifically, queries are expanded 
word usage across documents. See G. Golub et al., "Matrix conceptually, rather than physically, using semantically 
Computations," Johns-Hopkins, Baltimore, Second Edition, 60 similar and syntactically related words to those specified by 
1989. Retrieval is then performed using the database of the user in the query to reduce the chances of missing 
singular values and vectors obtained from the truncated relevant documents. To support query expansion, indices on 
SVD. Preliminary evaluation of LSI indicates that this words related by lexical semantics and syntactical 
approach of information retrieval is a more robust measure co-occurrence need to be maintained. Two issues become 
than that based on individual terms. 65 paramount in supporting such query expansion: the size of 

Automated query expansion has long been suggested as a index tables and the query processing overhead. In accor- 

technique for dealing with the word mismatch issue. See dance with the present invention, the notion of a multi- 
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granularity information and processing structure is used to 
support efficient query expansion, which involves an index- 
ing phase, a query processing and a ranking phase. In the 
indexing phase, semantically similar words are grouped into 
a concept which results in a substantial index size reduction 
due to the coarser granularity of semantic concepts. During 
query processing, the words in a query are mapped into their 
corresponding semantic concepts and syntactic extensions, 
using a dictionary and actual data contents, resulting in a 
logical expansion of the original query. Additionally, the 
processing overhead can be avoided. The initial query words 
can then be used to rank the documents in the answer set on 
the basis of exact, semantic and syntactic matches and can 
also be used to perform progressive query processing. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The above objects and advantages of the present invention 
will become more apparent by describing in detail preferred 
embodiments thereof with reference to the attached draw- 
ings in which: 

FIG. 1 illustrates the problem of. word mismatch in the 
context of information retrieval. 

FIG. 2 illustrates an example of indices which are tradi- 
tionally used in exact match information retrieval systems. 

FIG. 3 illustrates an example of indices derived by 
grouping words into semantically similar concepts and syn- 
tactically related extensions for use in traditional informa- 
tion retrieval systems. 

FIG. 4 illustrates the index structures required for more 
efficient query processing in accordance with the present 
invention. 

FIG. 5 illustrates the merging of co-occurring word index 
entries. 

FIG. 6 illustrates query expansion processing in a tradi- 
tional information retrieval system. 

FIG. 7 illustrates query expansion processing using the 
multi-granularity query expansion scheme of the present 
invention. 

FIG. 8 illustrates the ranking process in accordance with 
the present invention. 

FIG. 9 illustrates a two dimensional ranking graph for a 
query having two words. 

FIG. 10 illustrates a sequence for progressive query 
processing. 

FIG. 11 illustrates a sequence for progressive query 
processing where a keyword may be assigned a level of 
importances. 

FIG. 12 illustrates a physical embodiment which may be 
used to implement the present invention. 

DETAILED DESCRIPTION OF A PREFERRED 
EMBODIMENT 

A preferred embodiment of a method and apparatus for 
efficient query expansion is described below in detail with 
reference to the accompanying drawings. It is to be noted 
that while the following discussion is presented in the 
context of the NEC PERCIO Object Oriented Database 
Management System (OODBMS), the present invention is 
not so limited. The present invention may be applied to a 
variety of database systems and document collections. 

The present invention provides efficient indexing and 
processing support for query expansion by applying a con- 
cept of multi-granularity. The present approach takes the 
indices for semantically similar and syntactically related 
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words after word stemming using available techniques, (see 
e.g., J, Xu et al., "Query Expansion Using Local and Global 
Document Analysis," Proceedings of the 19th Annual Inter- 
national ACM SIGIR Conference, Zurich, Switzerland, 

5 1996; and C. Jacqucmin, "Guessing Morphology from 
Terms and Corpora," Proceedings of the 20th Annual Inter- 
national ACM SIGIR Conference, Philadelphia, Pa., USA, 
1997), and further reduces the index size by merging some 
entries (tuples) to an entry at a higher level of granularity. 

10 During query processing the tuples with information at a 
higher level of granularity are used to retrieve relevant 
documents. The original words of the query, at a finer 
granularity, can then be used to rank the documents in the 
answer set retrieved during query processing on the basis of 

15 exact, semantic, and syntactic similarity matches. By using 
multi-granularity indexing and query processing techniques, 
the advantages of smaller index size and faster query pro- 
cessing time are gained while preserving the overall preci- 
sion of the retrieval mechanism. 

20 Initially, the notion of multi-granularity will be discussed, 
as well as how it can be adapted in conjunction with the 
traditional indexing employed by most TR systems. Then, 
the storage overhead of multi-granularity indexing for a 
given document collection will be evaluated. 

25 Traditional IR systems maintain indices to facilitate the 
retrieval of a list of documents for a given word as well as 
to extract a set of words associated with a given document. 
It is to be noted that in the present context, the term 
document may refer to text, images or a combination of text 

30 and images. The indices are illustrated in FIG. 2. Note that 
the table in FIG. 2(b) is an inverted index of the table in FIG. 
2(a). In FIG. 2, the indices are shown as tables for ease of 
explanation. In actual implementation, classes on top of the 
NEC PERCIO OODBMS may be used, for example. Taking 

35 a sample query, if a user initiates a query with the words 
"car" and "dealer" the IR system fetches the list of docu- 
ments from the corresponding rows in FIG. 2(b). The 
intersection of the document lists from the two rows forms 
the answer to the query. Clearly, this approach to IR supports 

40 only exact matches and will fail to retrieve relevant docu- 
ments containing terms with similar meanings such as 
"automobile dealer""car showroom" or "automobile show- 
room" Query expansion can be used in conjunction with a 
special utility to expand the query from "car" and "dealer" 

45 to ("car" or "automobile"and ("dealer" or "showroom". 
Although this approach is feasible, it results in introducing 
significant query processing overhead. In particular, instead 
of two lookups of the index table in FIG. 2(b), several 
lookups are needed for each word semantically similar to 

50 that in the original query. Also, a thesaurus like facility such 
as an on-line dictionary is needed to expand the query terms 
to their semantically similar counterparts. These observa- 
tions have led to the development, in accordance with the 
present invention, of a more efficient scheme to support 

55 query expansion in querying document collections. 

As discussed earlier, in order to avoid mismatch between 
user and author vocabularies, query expansion based on 
expanding query words on semantic similarity and syntactic 
relationships are needed. FIG. 3 illustrates the additional 

60 data-structures that arc needed to facilitate query expansion 
in traditional IR systems. In particular, FIG. 3 is a table 
derived from a lexical-semantic on-line dictionary where 
words are grouped into semantically similar concepts. The 
table shown in FIG. 3 is simplified for ease of exposition. 

65 For example, the set of similar terms "car", "auto" "auto- 
mobile" and "sedan" is represented as a symbolic entity 
semi. Unlike for semantic similarity, which is based on a 
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dictionary or a thesaurus, syntactic 'relationships in IR are P-P form such as (Toyota, Avalon), (Acura, Legend), 

determined from the document collection itself. In (Nissan, Maxima). 

particular, word co-occurrence information can be used to S-P or P-S forms such as (Buick, car), (Buick, dealer), 

relate two words syntactically. FIG. 3(6) illustrates the index (car, Ford), (Ford, auto), and (Ford, dealer), 

that captures this information. By using, the traditional IR 5 S-S form such as (car, garage) and (auto, garage), 

indices of FIG. 2 in conjunction with the auxiliary indices of Generally, it will be difficult to convert the entries in FIG. 

FIG. 3, a rudimentary query expansion scheme can be 3(b) 0 f the form P— P to a coarser granularity. All other 

supported in an IR system. Basically, given a user query, the entries, however, have an S word which can be replaced by 

query word list is expanded to include words that are both its corresponding higher level semantic concept. This will 

semantically similar as well as syntactically related. 1Q result m a re duction in the size of co-occurrence index and 

Although the set-up discussed above can be used for will speed-up query processing, 

processing queries with query expansion, this approach ^ reduction in index size occurs as follows. For each 

results in a high processing overhead: In accordance with the emry ia s .p form> ( Wj> x ^ such mat W/ corresponds to a 

present invention, additional index structures that would sema nuc concept Sem„ replace all such (w„ X) entries of 

allow the queries to be processed more efficiently are nG ^ fa (§ ^ in FIG 4(c) ^ corresponding 

employed. The basic idea of me present approach is to doc0 £J { ^ m ' ^ d A similar dure is 

transform the indices in HGS. 2 and 3 so that queries can be form sho ^ n 

expanded conceptually. That is, instead of expanding the . t . /r? , v , ^ . . v . , . 

query word list physically by including in the Ust semanti- % c \"* ies < F * ut0 > * rc ^°L d * 

call/similar and syntactically related, words, the query is (*> rd > S ^)' Simdarly entnes (Ford dealer) and (Ford, 

expanded conceptually by replacing the query words by 20 showroom) are replaced by (Ford Sem2) This merge 

their corresponding higher level semantic concept and syn- mechanism is illustrated in FIGS. 5(a) and 5(6). 

tactic relationship (e.g. co-occurrence). This results in addi- Entries in S-S form can be merged in two ways: 

tional storage overhead due to the additional index struc- Simple merge: 1-to-many/many-to-l types of merge as 

tures. However, savings are realized since user queries can shown in FIGS. 5(a) and (6). For example, entries (car, 

be processed more efficiently. 25 dealer), (automobile, dealer), and (auto, dealer) are 

In order to process the expanded queries as described replaced by (Semi, dealer). The algorithm used here is 

above, the index tables are modified as shown in FIG. 4. In the same as that for S-P and P-S form, 

particular, the index table io FIG. 4(a) is derived from FIG. Complex merge: Many to many types of merge as shown 

2(a) by replacing each word (which is not a proper noun) by m piG. 5(c). An example is to represent entries (car, 

its higher level semantic concept. The index table FIG. 4(b) 30 dealer), (automobile, showroom), and (auto, 

is obtained by combining the words in FIG. 2(b) into their SalesOffice) as (Semi, Sem2). The algorithm for this 

corresponding higher level semantic concept and merging t of c fa M foUows: 

the respective document list entries. Thus the row entries h For each in s . s f ( ^ such mat 

corresponding to "car* , "auto , automobde and sedan corr es P onds to a semantic concept Sem,, replace all 

appear as a single entry Semi in FIG. 4(b). Similar y the 35 such ( ^ entries of nG ^ b (s ^ m ^ 

rows corresponding to "dealer", "showroom", and "Sale- v p ' 17 

sOffiee" in FIG. 2(b) collapse into a single row labeled sem 2 ^ ^ ^ rf typg ^ ^ ^ ^ w . 

, . corresponds to a semantic concept Sem.-, replace all 

The index for syntactically related words is usually much such (s j by (s s ^ 

larger than the index for semantically related words because 4Q Note ^ step % may 5e performed be f or e step 1. 

of several reasons. A large number of words on the Web are Additionally, steps 1 and 2 of the algorithm may be itera- 

proper names and are not be found in a dictionary. Accord- ^vely performed until no further merges are possible, 

ing to a study by the present inventors on parsing 2,904 When multiple entries are merged> the respective syn_ 

documents, only 42% of keywords can be found in doc _ji st for each of me entries is also merged accordingly 

WordNet, which has more than 60,000 words. See G. A. 45 . ft union opcration . 

Miller. ""Nouns in WordNet: A Lexical Inheritance System" m ^1^™^ mdexirig scheme may be imple . 
In International Journal of Lexicography 3 (4), 1990, pages mcnted 0Q to of an QODBMS. In such an implementation, 
245-264. The other 58% of words include proper names and ^ (ableg fa mGS ^ 3(fl) and 4(c) m dasses ^ 
typographical errors, which have contributed to the large contcnts . ^ other tables are classes with only pointers, 
size of the indices. In the traditional IR systems, the syn- 5Q Update> and mseft operations on indices may 5e 
tactic association is generally captured through carrie d out by the OODBMS through automatic view main- 
co-occurrence relationship. Since word co-occurrences m tenance 0f ms to propagate among classes. The main- 
the same document are one to one relationships, if there are (enance of ^,^4™^ indices is done incrementally; 
n words identified, the worst case for the index size is reorganization is not required. 

5S Next, an example estimate will be calculated regarding 

n *t ft ~ the additional storage overhead that is needed to support 

2 index-tables based on semantic concepts, in accordance with 

the present invention, in addition to the traditional word- 
associations. It is too expensive to index co-occurring words based indices. As discussed in the previous section, the 
of more than two due to enormous storage and indexing 60 tables of FIG. 4 have been introduced for efficient query 
overhead. processing. Initially, the storage estimate for the indices used 
Let the words found in a dictionary be denoted as S in a traditional IR system, i.e., the tables shown in FIG. 2, 
(semantically meaningful) and all other words be denoted as will be calculated. Assume that the number of documents in 
P (proper names). Based on the above classification of the given collection is D. Furthermore, assume that the 
dictionary and non-dictionary words, the co-occurrence rela- 65 number of distinct dictionary words (after removing stop- 
tionship between words can be classified into three different words and grouping words by using word-stemming) in the 
categories: given document collection is W and the number of non- 
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dictionary words be V. Let the average number of dictionary larger than f d. On the other hand, the average number of 
words per document be w, non-dictionary w, words be v and concepts per documents cannot become larger than w. In 
the average number documents per word be d. The index fact, it can be argued that this number will be comparable to 
sizes in terms of number of entries (i.e., number of rows) as w - Based on these parameters, the additional storage over- 
well as in terms of the total size (i.e.; number of pointers) of 5 head of multi-granularity indexing can be computed. The 
the table is computed. Note that each entity in the table is computation for the table in FIG. 4(a) is: 
represented as a pointer data-type. Given these parameters, Number of Rows(4(a)}-D (9) 
the size of table in FIG. 2(a) is: TbtalSizcKOWl^^ (10) 

Number of Rows[2(a)H> (1) 10 That is, the size remains the same as that in FIG. 2(a). On 

the other hand the size of the table in FIG. 4(b) is: 

^ un Number Of Rowsl4(b)>JV// (11) 

Note that the term 1+v +w arises because in each row one Tbtalsize[4(b)HiW/)- wy/ (12) 

pointer is needed for the document identifier, on the average The number of dictioDary word C[ltrics rcducc by a factor of 

v pointers are needed to represent non-dicUonary words in ' 5 f due {Q the WQrds ^ { {n{Q semantic c t 

the word list and, on average w pointers are needed for However> me num5er of docume nts per semantic concept 

dictionary words in the word list. Similarly, the size of the by approxmiately ^ ^ factor M a result the 

table in FIG. 2(b) is given as: size of ^ uble remains comparable to the table in FIG. 

Number of Row S [2(b)^ (3) 20 W)- Note that the tables in FIGS. 4(a) and (6) are the two 

1 wr w tables of FIGS. 2 (a) and (b), respectively, at a higher level 

Tbulsiz42(b)Hi+<0-(w+V) ( 4 ) °^ granularity. Finally, the estimated storage of the table of 

FIG. 4(c) is computed: 

Each row in this table needs, on average, d pointers for the 

document identifiers in the document list and one pointer for . ..... .... w w W(W-i) (13) 

, . .„ 25 Number Of Rows[5{b)\ = V(V - l)/2 + V- — + — — — 

the word itself. / 2p 

Next, the storage overhead of the on-line dictionary and 

the syntactic co-occurrence table that are needed to support TotalSizt&m = (1 + 2 + q)- v(v - 1)/2 + < 14 ) 

basic query expansion is estimated. Let f be the compression (1 + 2+ of) ♦ V • — + (1 ■*- 2 + af) w ^ w ~ l 2 

factor that is obtained by grouping dictionary words into 3Q f 2 f 2 
semantic concepts. Thus, f can be viewed as the average 

number of words grouped by a concept. The table size in Basicaliy> all the co-occurrence terms involving the S form 

FIG. 3(a) is: are CO mpressed by a factor of f and result in substantial 

vt i ^ c „ r,/ \i T.«r /c\ savings in this table when compared to that in FIG. 3(b). 

Number Of Rows[3(a)]=W7/ (5) * f. \ ' 

35 Finally, in accordance with the present invention, all of 

Totaisize(3(a)>w+Hy/ (6) the tables except FIG. 3(b) are required. On the other hand, 

the rudimentary query expansion scheme will need tables in 

Equation (5) arises since the dictionary word space shrinks FIGS. 2 and 3. Thus, although the storage cost in the present 

by the compression factor f, whereas in Equation (6), W scheme increases due the tables 4(a) and 4(b) it is partially 

pointers are needed to represent the. words in the word list 40 compensated due to the reduced size of table 4(c). The exact 

and W/f pointers are needed to represent the semantic savings will depend on tbe various values of the parameters 

identifiers. The term q represents the average number of discussed above. In the worst case the extra storage will be 

entries in the document fist per co-occurrence term. The significantly less than twice that in the rudimentary query 

worst case size in the table in FIG. 3(b) is given by the expansion scheme. 

following expressions: 45 The indexing scheme presented above has been discussed 

with the assumption that a word has only a single sense. 

Number Of Rows[3(b)>»v(v-i)/2 + vw + wcw-i)/2 (7) Howcver> words generally have multiple senses. For 

Toulstze(3(b)>(i + 2 +<7 >(v(v-i)/2 + vw + ^iy2) (8) example, the word bank can be interpreted as a financial 

institution or as a riverbank. To consider words with mul- 

In Equation (7) the first term corresponds to the word 50 tiple senses, a word in the Sem_word_list (shown in FIG. 

co-occurrences of the form P — P, the second to S-P or P-S, 3) may belong to multiple Concept# in FIG. 4(a). For 

and the last corresponds to the co-occurrences of the form example, bank may be associated with SemlO and Sem20. 

S — S. The term q represents the average number of entries To perform query expansion with consideration of multiple 

in the document list per co-occurrence term. In addition senses, when a query contains a word which is located in 

three pointers are needed to represent the syntactic term 55 multiple Concept#, each of the different Concept# should be 

identifier and the two words involved in a co-occurrence taken into account during query processing, 

relationship in each row. In the above discussion, the indexing scheme has been 

Next, the storage overhead of multi-granularity indexing implemented on top of NEC OODBMS, and the words in 

in accordance with the present invention, which groups Sem_word_Jist are associated with Concept# through 

semantically similar set of terms into a unique semantic 60 pointers. No redundant data is stored and storage costs for 

concept, is estimated. As before, in order to compute the pointers are very low. Since WordNet can provide synonyms 

sizes of index tables in FIG. 4 the average number of for a word in different sense interpretations ranked by how 

documents per semantic concept and the average number of often such senses are used (e.g. bank is interpreted as a 

semantic concepts per document need to be estimated. Since financial institution more often than being interpreted as a 

multiple terms reduce to a semantic concept, the average 65 river bank), the most popular sense interpretation is used in 

number of documents per semantic concept will be larger the current implementation. However, the data structure is 

than d and it can be argued that this expansion will be no extendible. 
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Semantic groupings other than what has been presented less expensive. Next, the number of entities (words or 

above may also be taken into consideration. In FIG. 4, only concepts) introduced in the query expansion of prior 

query expansion by synonym is considered. Other forms of schemes as well as in the multi-granularity query expansion 

semantic relaxation may also be considered, such as ISA, scheme of the present invention will be estimated. Recall 

IS_PART_OF, etc. Multiple tables of the form shown in S that the average number of dictionary words grouped under 

FIG. 4(a) can be created for various semantic groupings (for a higher level semantic concept is denoted by f. Let g be the 

example, one for ISA and one- for IS_PART_OF). average number of semanl i ca lly related higher level con- 

Altematively, a single table can be used for the various te to a word md kt h bc lhc a c numbcr of 

semantic groupings. To perform query expansion by both s tacdcalI related names related to a word . ^ en 

tiTuMes hypCmymS ' l00ku P sma y bc madc 10 10 the number of words in Q under basic query expansion (BQ) 

\ , \ A , ... . , P , . . . is the total expansion that occurs in steps 1, 2, and 3: 

In order to deal with the problem of word mismatch, a r r 

query processing scheme needs to expand the query words C^BOHrfW***)****** 
with relevant words. As a result, an additional task for 

ranking the documents by their relevance to the original is where ^ ^ term arises sinoe eadj of the m dicti 

query words may be performed. Next, query processing (With WQrds fe , aced b f semanticaJly similar words . ^ 

expansion will be presented as three tasks in accordance At . . , ... 4 . , , 

t . . second term arises since each of the m dictionary words has 

with the present invention: query expansion, query .... . , . .. . J . 

. r , u , . ^ - r additional g+h co-occurrences of dictionary and non- 
processing, and result ranking. . 6 _,. J . J - _ 

First, query expansion will be discussed. FIG. 6 shows an 20 *«ionaiy types. Tins third term corresponds to each of n 
example of query expansion under a prior query expansion P ro P er names addm & g +h co-occurrences. Similarly, the 
scheme. A query "retrieve documents containing the words number of words and concepts in Q under the multi- 
car and dealer" is rewritten as shown by adding additional granularity query expansion (MGQ) is as follows: 
words relevant to car and dealer. The relevant words of 

semantic similarity and syntactic co-occurrence relationship 25 Co^MGQym +*»($//+/») +n{g/f+h) 
are determined from tables in FIG. 3. An example of a query 

under the mulu-granularity query expansion scheme of the Essentially, the main distinction is that the compression 

present invention is shown in FIG. 7. Multi-granularity factor f arises since we are using the higher level semantic 

query expansion transforms the words car and dealer into the representation of the set of similar words is being used, 

concepts Semi and Sem2 by using. the table in FIG. 3(a). 30 Thus, the number of words/concepts involved in the query 

After translating the words into their corresponding higher under multi-granularity query expansion is strictly smaller 

level semantic concept, the table in FIG. 4(c) is used to than that in basic query expansion. If the number of proper 

expand the semantic concepts to include syntactic relation- names per word is small in the table of FIG. 4(c), then the 

ship as well as the proper names in the original query are query complexity in the present expansion scheme reduces 

expanded to include relevant words from the co-occurrence 35 bv a factor of f. 

ta ^l c .' - . . . . , , . * j Turning now to query processing, in traditional query 

Given a query Q involving both dictionary and non- . & , , \ A !l l 

,. j iL r\ u f a processing, based on exact matching, the search process can 

dictionary words, the query Q can be represented as: J* i . 6 ' , . , 6 ' A . . \, A . 

be termmated as soon as it can be determined that search 

(H^A ■ • aO a (pA • • ApJ 40 predicate associated with the query cannot be satisfied. This 

is not the case in IR since the search is based on similarity. 

In this equation s„ represents a dictionary word and p, In rticul ^ user ma waD , t0 see results with arlial 

represents a non-dictionary word. Furthermore, there are m matches tQ ^ seafch TOeief for , ^ 

dictionary words and n non-d.ctionary words in the query Q. hoh} ^ needed ^ ^ M ^ q[ ^ 

Given such a query, the multi-granulanty query expansion is . , ... . «. — .« 

c j r ii 45 boolean conditions m the search predicate. Furthermore, 

performed as follows: . . , L . , r , ,. . , t . . ' 

„ , . _ . , , since partial match is supported, an additional ranking phase 

1. For each ,m replace s„ in Q with its fa Qeedcd ^ pr0C xssing. The ranking scheme 
corresponding htgher level semanUc concept from toe ||>t informi|inn fln which ^^see^^ 
Ubk in FIG. 3(a). Each such concept will be denoted ^ the auc ^T n d the word frequency in the documenir 

^ ~ ' t. ^ • + ^ j 50 ^fow the lookup costs associated with processing a query 

2. For each C,, i=l m, obtained from step 1, expand , , , r .... . . ^ T . tl _ 1 / 

- . . . ' / . ' . . . J ' X i_ under the two schemes will be analyzed. Note that the basic 

Q by .nclud^g syntactically related words to C< by processing cost will arise due to two 

using the table in FIG. 4(c). The entries of the form f ac tors- 

S — S will contribute additional concepts and those of 

the form S-P will contribute proper names. 55 The number of words in the basic query expansion is 

3. For each p ; , j=l, . . . , n, expand Q by including l ar S er . than that in the multi-granularity query 
syntactically related words by co-occurrence to p y by expansion, and 

using the table in FIG. 4(c). The entries of the form P-S The number of entries in the respective tables over which 

will contribute additional concepts and those of the the lookup will be performed is different under the two 

form P — P will contribute additional proper names. 6 q schemes. 

4. Remove redundant query words or concepts in Q. Now the lookup costs for the query Q considered above 
Compared with a query expanded with' the traditional will be estimated. Assuming that the tables are orga- 

scheme, an expanded query in accordance with the present nized in the form of a balanced search structures, the 

invention is more compact with fewer conditions to check operation of table lookup will be logarithmic in the 

since the query words are converted to entities at a coarser 65 number of rows in the table. Hence, by using the 

granularity. As a result, the query processing for the estimates developed in Section 3, the lookup cost of 

expanded query in accordance with the present invention is executing Q under basic query expansion is: 
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LookupCosi{Q t BQ) = mf • Xo&Number Of Rows[2{b))) + 

(m + n) ■ {g + h) ■ \o£Number Of Rows{3(b)\) 
= mf\o£W + V) + 5 
(m + n) • (£ + A) • log V(l> - 1 ) / 2 + W + 
W(W-l)/2) 

10 

Similarly, the lookup cost of executing Q under multi- 
granularity expansion will be: 

Lookup Cosi{Q, MGQ) = m ■ lo^Number Of Rows{4{b)]) + 

(m + n)(g// + A). 15 
\o&Number Of Rows[4{c)]) 
= m • log( W7/ + VO + (m + n) • // + A)- 

Since the number of dictionary word lookups are reduced by 
a factor of f in MGQ and the sizes of both tables on which 
the lookup is performed is smaller in MGQ, it is clear that 25 
the query processing cost under MGQ is lower than that in 
BQ. 

Turning now to the ranking scheme of the present 
invention, in the query processing phase, word representa- 
tions at a coarser granularity are used for filtering out 30 
unrelated documents. However, the candidate documents 
have the same ranking since they all satisfy both conditions, 
car and dealer at a coarser granularity level. This is not a 
desired property for query processing results. Therefore, in 
the ranking phase, the original words in the candidate 35 
documents are accessed and are used for ranking. 

In FIG. 8, four candidate documents are shown with 
keywords satisfying the condition: 

(Semi V Ford V Buick)A(Sem2 V Ford V Buick) 4Q 

Their initial matching keywords are retrieved for ranking. 
Thus, (car, dealer), (auto, dealer), (auto, sales office), and 
(Ford, showroom) are used to rank their degrees of rel- 
evance. 

The candidates are ranked based the degrees of relaxation 45 
in matching words in the document with words in the query. 
In one example, the degrees of relaxation may be defined in 
the order of E<Se<Sy<X, (i.e. Exact Match<Semantic 
Relaxation<Syntactic Relaxation<Do Not Match), where a 
word with a higher degree of query relaxation with respect 50 
to the query word means that the query results with such a 
word are less relevant to the user. However, the order and 
definition of degrees of relaxation can be arbitrary as appli- 
cations require. The less relaxation that was used to find the 
matching candidate the higher the candidate document is 55 
ranked. On the bottom of FIG. 8, the document with the 
words car and dealer is given the highest rank since the 
candidate words match the query words exactly. The docu- 
ment with words "auto" and "dealer" has the second highest 
rank since only one word requires semantic relaxation (i.e. 60 
replacing query terms with semantically related terms) to be 
matched with the query word car. Other ranking is carried 
out similarly in FIG. 8. 

The ranking scheme is based on the following two prin- 
ciples: 65 

For a given query keyword, Q, if the relationship between 
Q and the keywords Wordl, Word2, Word3 and Word4 



B2 

14 

in Docl, Doc2, Doc3 and Doc4 respectively are: 
exactly match, match through semantically query 
relaxation, match through syntactically query relax- 
ation and do not match, the documents are ranked as 
follows: Docl>Doc2>Doc3>Doc4. 
The ranking (scores) for M documents, Doc,-, i . . . M, with 
Match,-, i . . . M keywords matching the query respec- 
tively are as follows: Doc 1 >Doa 2 >Doc 3 . . . Doc Af . 1 > 
DOC^ if Match J >Match 2 >Match 3 . . . Match^.^ 
Match^ 

Based on the above ranking scheme using a query with 
two keyword terms, a two-dimensional ranking graph can be 
generated for the documents for a query with two words as 
shown in FIG. 9. Without query expansion, only documents 
in the slot (E, E) are retrieved. With both semantically and 
syntactical query expansion, all relevant documents are 
retrieved unless the documents are in the slot (X, X). 

The ranking graph is represented as a matrix. For a query 
with N terms the ranking graph is represented by a N by 4 
matrix: M(i, j), i =0 . . . N and j=0 . . . 3. For example, the 
ranking graph in FIG. 9 is represented as a matrix, M(i, j), 
i=0 . . . 2 and j=0 . . . 3. The slots (E, E), (Se, E), (Se, Sy), 
and (X, X), for example, are represented as slots (3,3), (2,3), 
(2,1), and (0,0), respectively, in the matrix. With this 
representation, the documents can be easily ranked as fol- 
lows: 

For the documents in the slot (n, m), where m is between 
0 and 3, the ranking of these documents scores are 
higher than the documents in slots (i, j), where i=0 . . . 
n and j-0 ... 3. 
The ranking score for the documents in the slots (n 1, m 
1) is higher than or equal to the documents in the slots 
(n2, m2) if nl^n2 and ml^m2. 
The presentation of the ranking graph is carried out by 
available visualization tools. For example, a visualization 
method, called Cone Trees, can be modified by adding the 
depth for 3 dimensional ranking graph presentation. See G. 
G. Robertson et. al., "Information Visualization Using 3D 
Interactive Animation^Communications of the ACM, Vol. 
36, No. 4, pages 57-71, April, 1993. 

Based on this ranking scheme, the results in the slot at the 
top of FIG. 9 have higher ranking scores than the results at 
the bottom. However, it is difficult to rank the results in slots 
at the same class in FIG. 9. FIG. 10 shows how such a 
ranking may be accomplished. The result slots are further 
classified into classes, where results in slots in the same class 
have the same ranking. 

With the class structure of FIG. 10, query processing, in 
accordance with the present invention, can be performed 
progressively by class. Take the example where a user issues 
a query with two keywords and requests that the top 50 
results be retrieved. Referring to FIG. 10, the query proces- 
sor can initially produce the results in class 0. If the number 
of results is greater than 50, the query processor can stop 
without performing the query expansion task. If the number 
of results in class 0 is less than 50, the query processor can 
then produce the results in class 1 (e.g. slots (2,3) and (3,2)). 
If the total number of results (e.g. in class 0 and class 1) is 
greater than 50, the query processor can stop without per- 
forming further query processing. Note that the query pro- 
cessor can produce the results in slots (2,3) and (3,2) 
successively as well. That is to say, the query processor can 
first produce the results in slot (2,3) first. If the total number 
of results is greater than 50, the query processor can stop 
without producing the results in slot (3,2). The query pro- 
cessor can continue producing additional results from the 
remaining slots and classes as described, until the total 
number of results is greater than 50, or until the last class is 
reached. 
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If the above example is modified so that the user specifies 
that one keyword is more important than the other, the order 
in which the query processor retrieves slots of results can be 
modified accordingly. For example, if the user specifies that 
keyword 1 is more important than keyword 2, a horizontal 5 
query processing order within classes can be derived, as 
shown in FIG. 11. Then, in the example, the query processor 
would produce the results in slot (3,2) first. Then, if the total 
number of results was less than 50, then the query processor 
would subsequent produce the results in slot (2,3). 

FIG. 12 shows a physical implementation of a system 
upon which the present invention may be practiced. Such a 
system includes a database 1206 for storing a collection of 
documents. The database may include an index 1208 for 
storing concepts (e.g. semantical or syntactical concepts) 
and their relationships to the documents in the collection. 15 
The system may further include an iridexer 1210 for creating 
the index 1208 and for also creating an index 1208 contain- 
ing higher level granularity concepts and their relationships 
to the documents in the collection. A processor 1204 may be 
used to accept queries presented by a user through the user 20 
interface 1202. The processor 1204 then processes the query 
and performs ranking function. The results of the query and 
ranking function are presented back to the user through the 
user interface 1202. 

Those skilled in the art will recognize that the exemplary 2 s 
environment illustrated in FIG. 12 is not intended to limit the 
present invention. Indeed, those skilled in the art will 
recognize that other alternative hardware environments may 
be used without departing from the scope of the invention. 
For example, the various functions described above may be 30 
performed by separate elements (e.g. the query processing 
and ranking functions may be performed by different 
components) or may be performed by a single element (e.g. 
a single processor may perform the indexing, query pro- 
cessing and ranking functions). 3s 

In summary, the present invention presents a novel tech- 
nique for supporting query expansion efficiendy using a 
multi-granularity indexing (saving index space) and query 
processing (saving processing time) schemes while the 
original effectiveness (i.e. precision and recall) of a given 40 
input set of document keywords, lexical semantics 
dictionary, and queries is preserved. 

The multi-granularity indexing and query processing 
scheme in accordance with the present invention allows for 
smaller size indices for word associations, faster query 45 
processing time since queries are simplified, and consistent 
ranking results since the ranking technique in accordance 
with the present invention is based on initial words in 
documents. 

Other modifications and variations to the invention will be 50 
apparent to those skilled in the art from the foregoing 
disclosure and teachings. Thus, while only certain embodi- 
ments of the invention have been specifically described 
herein, it will be apparent that numerous modifications may 
be made thereto without departing from the spirit and scope 55 
of the invention. 

What is claimed is: 

1. A method of querying a database of documents, the 
database including a preliminary index of the documents, 
words contained in the documents and associations 6Q 
therebetween, the words in the preliminary index being of an 
original granularity, the method comprising the steps of: 

a) replacing the words in the preliminary index with 
corresponding higher granularity concepts, resulting in 

a coarser granularity index of reduced index size; 65 

b) logically expanding a query applied to the database of 
documents by replacing only the words of the query, 
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being of the original granularity, meeting a predeter- 
mined criterion, which is whether the words can be 
found in a lexical dictionary with corresponding ones 
of the higher granularity concepts, 
b)(i) wherein the higher granularity concepts are higher 

granularity semantic concepts, 
b)(ii) further logically expanding the query by adding 
syntactically related words for each of the corre- 
sponding ones of the higher granularity concepts; 
b)(iii) further logically expanding the query by adding 
syntactically related words for each of the words in 
the query failing to meet the predetermined criterion; 
b)(iv) replacing ones of the syntactically related words 
meeting the predetermined criterion with associated 
ones of the higher granularity concepts; and 
b)(v) removing any redundant ones of the syntactically 
related words and higher granularity concepts from 
the expanded query; 

c) executing the logically expanded query to retrieve ones 
of the documents associated, through the coarser granu- 
larity index, with the corresponding ones of the higher 
granularity concepts; and 

d) retrieving ones of the documents in order of relevance 
until a predetermined number of ones of the documents 
associated with the corresponding ones of the higher 
granularity concepts are retrieved, wherein the order of 
relevance is an exact match, a semantic match, a 
syntactical match and no match between the words of 
the query and the words contained in the retrieved ones 
of the documents. 

2. The method according to claim 1, wherein in the 
replacing step, the higher granularity concepts are higher 
granularity semantic concepts. 

3. The method according to claim 2, wherein the higher 
granularity semantic concepts each contain synonyms. 

4. The method according to claim 1, wherein in the 
replacing step, only ones of the words in the preliminary 
index meeting a predetermined criterion are replaced by the 
corresponding higher granularity concepts. 

5. The method according to claim 4, wherein the prede- 
termined criterion is whether the words can be found in a 
lexical dictionary. 

6. The method according to claim 1, wherein in the 
replacing step, the higher granularity concepts are higher 
granularity syntactical concepts. 

7. The method according to claim 6, wherein the higher 
granularity syntactical concepts each contain words 
co-occurring in ones of the documents above a threshold 
level of frequency. 

8. The method according to claim 1, wherein in the 
replacing step, ones of the words in the preliminary index 
having multiple meanings are replaced by multiple ones of 
the corresponding higher granularity concepts. 

9. The method according to claim 1, wherein words 
failing to meet the predetermined criterion are proper nouns. 

10. The method according to claim 1, wherein the step of 
executing progresses in successive stages until a predeter- 
mined number of the ones of the documents associated with 
the corresponding ones of the higher granularity concepts 
are retrieved. 

11. The method according to claim 10, wherein each of 
the stages represents a class of expansion. 

12. The method according to claim 10, wherein each of 
the stages represents a slot within a class of expansion. 

13. The method according to claim 10, wherein within 
each of the stages, the ones of the documents are retrieved 
in an order reflecting a level of importance assigned to at 
least one of the words of the query. 
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14. A method querying a database of documents, the 
database including an index of reduces index size of the 
documents, higher granularity concepts and associations 
therebetween, the higher granularity concepts corresponding 
to words of original granularity contained in the documents, 
the method comprising the steps of: 

a) logically expanding a query applied to the database of 
documents by replacing only words of the query, being 
of the original granularity, meeting a predetermined 
criterion, which is whether the words can be found in 
a lexical dictionary, with corresponding ones of the 
higher granularity concepts, 

a)(i) wherein the higher granularity concepts are higher 
granularity semantic concepts, 

a)(ii) further logically expanding the query by adding 
syntactically related words for each of the corre- 
sponding ones of the higher granularity concepts; 

a)(iii) further logically expanding the query by adding 
syntactically related words for each of the words in 
the query failing to meet the predetermined criterion; 

a)(iv) replacing ones of the syntactically related words 20 
meeting the predetermined criterion with associated 
ones of the higher, granularity concepts; and 

a)(v) removing any redundant ones of the syntactically 
related words and higher granularity concepts from 
the expanded query; 

b) executing the logically expanded query to retrieve 
documents associated, through the index, with the 
corresponding ones of the higher granularity concepts; 
and 

c) retrieving ones of the documents associated with the 
corresponding ones of the higher granularity concepts 
are retrieved, wherein the retrieved ones of the docu- 
ments are ranked using the words of the query, being of 
the original granularity, 

15. The method according to claim 14, wherein the higher 
granularity semantic concepts each contain synonyms. 

16. The method according to claim 14, wherein the higher 
granularity concepts are higher granularity syntactical con- 
cepts. 

17. The method according to claim 16, wherein the higher 
granularity syntactical concepts each contain words 
co-occurring in ones of the documents above a threshold 
level of frequency. 

18. The method according to claim 14, wherein the words 
failing to meet the predetermined criterion are proper nouns. 

19. The method according to claim 14, where the syntac- 
tically related words are words co-occurring in one of the 
documents above a threshold level of frequency. 

20. The method according to claim 14, wherein the order 
of relevance is an exact match, a semantic match, a syntac- 
tical match and no match between the words of the query 
and words in the retrieved ones of the documents. 

21. The method according to claim 14, wherein ones of 
the words of original granularity contained in the documents 
correspond to multiple ones of the higher granularity con- 
cepts. 

22. The method according to claim 14, wherein the step 
of executing progresses in successive stages until a prede- 
termined number of the ones of the documents associated 
with the corresponding ones of the higher granularity con- 
cepts are retrieved. 

23. The method according to claim 22, wherein each of 
the stages represents a class of expansion. 

24. The method according to claim 22, wherein each of 
the stages represents a slot within a' class of expansion. 

25. The method according to claim 22, wherein within 
each of the stages, the ones of the documents are retrieved 
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in an order reflecting a level of importance assigned to at 
least one of the words of the query. 

26. A system for querying a database of documents, the 
database including a preliminary index of the documents, 
words contained in the documents and associations 
therebetween, the words in the preliminary index being of an 
original granularity, the system comprising: 

a) an indexer for replacing the words in the preliminary 
index with corresponding higher granularity concepts, 
resulting in a coarser granularity index of reduced 
index size; 

b) a user interface for providing a query to be applied to 
the database of documents; and 

c) 3" processor~f6T*!ogically expanding the query by 
replacing only the words of the ouerv, bein^ of th e 
pri mal granularity, meeting a predeteim inedjc riler^ ojQ. 
which is whether tne woras~can~oe" found'ui a lexical 
dictionary, with corresponding ones of the higher 
granularity concepts, whereupon the processor 
executes the logically expanded query to retrieve ones 
of the documents associated, through the coarser granu- 
larity index, with the corresponding ones of the higher 
granularity concepts, wherein the processor retrieves 
ones of the documents in order of relevance until a 
predetermined number of ones of the documents asso- 
ciated with the corresponding ones of the higher granu- 
larity concepts are retrieved, using the words of the 
query, being of the original granularity, and wherein the 
order of relevance is an exact match, a semantic match, 
a syntactical match and no match between the words of 
the query and the words contained in the retrieved ones 
of the documents, 

c)(i) wherein the higher granularity concepts are higher 
granularity semantic concepts, and wherein logically 
expanding the query further comprises; 

c)(ii) adding syntactically related words for each of the 
corresponding ones of the higher granularity con- 
cepts; 

c)(iii) adding syntactically related words for each of the 
words in the query failing to meet the predetermined 
criterion; 

c)(iv) replacing ones of the syntactically related words 
meeting the predetermined criterion with associated 
ones of the higher granularity concepts; and 

c)(v) removing any redundant ones of the syntactically 
related words and higher granularity concepts from 
the expanded query. 

27. The system according to claim 26, wherein the higher 
granularity concepts are higher granularity semantic con- 
cepts. 

28. The system according to claim 27, wherein the higher 
granularity semantic concepts each contain synonyms. 

29. The system according to claim 26, wherein in the 
indexer replaces only ones of the words in the preliminary 
index meeting a predetermined criterion by the correspond- 
ing higher granularity concepts. 

30. The system according to claim 29, wherein the pre- 
determined criterion is whether the words can be found in a 
lexical dictionary. 

31. The system according to claim 26, wherein the higher 
granularity concepts are higher granularity syntactical con- 
cepts. 

32. The system according to claim 31, wherein the higher 
granularity syntactical concepts each contain words 
co-occurring in ones of the documents above a threshold 
level of frequency. 

33. The system according to claim 26, wherein ones of the 
words in the preliminary index having multiple meanings 
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arc replaced by multiple ones of the corresponding higher 
granularity concepts. 

34. The system according to claim 26, wherein words 
failing to meet the predetermined criterion arc proper nouns. 

35. The system according to claim 26, wherein the execu- 
tion of the query progresses in successive stages until a 
predetermined number of the ones of the documents asso- 
ciated with the corresponding ones of the higher granularity 
concepts are retrieved. 

36. The method according to claim 35, wherein each of 
the stages represents a class of expansion. 

37. The method according to claim 35, wherein each of 
the stages represents a slot within a class of expansion. 

38. The system according to claim 35, wherein within 
each of the stages, the ones of the documents are retrieved 
in an order reflecting a level of importance assigned to at 
least one of the words of the query. 

39. A system of querying a database of documents, the 
database including an index of reduced index size of the 
documents, higher granularity concepts and associations 
there between, the higher granularity concepts correspond- 
ing to words of original granularity contained in the 
documents, the system comprising: 

a) a user interface for providing a query to be applied to 
the database of documents; and 

b) a processor for logically expanding the query replacing 
only words of the query meeting a predetermined 
criterion, which is whether the words can be found in 
a lexical dictionary, being of the original granularity, 
with corresponding ones of the higher granularity 
concepts, whereupon the processor executes the logi- 
cally expanded query to retrieve documents associated, 
throughout the index, with the corresponding ones of 
the higher granularity concepts, 

b)(i) wherein the higher granularity concepts are higher 
granularity semantic concepts, and wherein the pro- 
cessor logically expands the query by further; 

b)(ii) adding syntactically related words for each of the 
corresponding ones of the higher granularity con- 
cepts; 

b)(iii) adding syntactically related words for each of the 
words in the query failing to meet the predetermined 
criterion; 



!0,843 B2 

20 

b)(iv) replacing ones of the syntactically related words 
meeting the predetermined criterion with associated 
- ones of the higher granularity concepts; and 
b)(v) removing any redundant ones of the syntactically 
s related words and higher granularity concepts from 

the expanded query, wherein the processor further 
retrieves ones of the documents in order of relevance 
until a predetermined number of ones of the docu- 
ments associated with the corresponding ones of the 
higher granularity concepts are retrieved. 
io 40. The system according to claim 39, wherein the higher 
granularity semantic concepts each contain synonyms. 

41. The system according to claim 39, the higher granu- 
larity concepts are higher granularity syntactical concepts. 

42. The system according to claim 41, wherein the higher 
15 granularity syntactical concepts each contain words 

co-occurring in ones of the documents above a threshold 
level of frequency. 

43. The system according to claim 39, wherein the words 
failing to meet the predetermined criterion are proper nouns. 

20 44. The system according to claim 39, where the syntac- 
tically related words are words co-occurring in one of the 
documents above a threshold level of frequency. 

45. The system according to claim 39, wherein the order 
of relevance is an exact match, a semantic match, a syntac- 
tical match and no match between the words of the query 

25 and words in the retrieved ones of the documents. 

46. The system according to claim 39, wherein ones of the 
words of original granularity contained in the documents 
correspond to multiple ones of the higher granularity con- 
cepts. 

30 47. The system according to claim 39, wherein the execu- 
tion of the query progresses in successive stages until a 
predetermined number of the ones of the documents asso- 
ciated with the corresponding ones of the higher granularity 
concepts are retrieved. 

35 48. The method according to claim 47, wherein each of 
the stages represents a class of expansion. 

49. The method according to claim 47, wherein each of 
the stages represents a slot within a class of expansion. 

50. The system according to claim 47, wherein within 
4Q each of the stages, the ones of the documents are retrieved 

in an order reflecting a level of importance assigned to at 
least one of the words of the query. 

***** 
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UNITED STATES PATENT AND TRADEMARK OFFICE 

CERTIFICATE OF CORRECTION 



PATENT NO. : 6,480,843 B2 Page 1 of 1 

DATED : November 12, 2002 • 

INVENTOR(S) :Wen-SyanLi 



It is certified that error appears in the above-identified patent and that said Letters Patent is 
hereby corrected as shown below: 



Column 6, 

Line 22, delete "TR" and insert -- IR -; 

Line 42, after "dealer" insert , -; after "showroom" ", insert -- , --; 

Line 43, after " room" ", insert - . --; 

Line 45, after " "automobile" ", insert -- ) --; 

Line 66, after " mobile" ", insert - , --. 

Column 8, 

Line 13, delete "wi", insert ~ Wj - (both occurrences); 

Line 14, delete "Sen^", insert -- Semi and delete "wj", insert -- w { 

Line 15, delete "Semi", insert -- Sem^ --; 

Line 33, delete "wi", insert - wi --. 



Column 9, 

Line 2, after "non-dictionary", delete "w,'\ 
Column 14, 

Line 37, after "Animation" ", insert , --. 



Column 17, 

Line 30, after "documents", insert - in order of relevance until a 
predetermined number of ones of the documents --. 
Line 34, after "granularity", delete "," and insert -- . --. 



Signed and Sealed this 
Fourth Day of March, 2003 




JAMES E. ROGAN 
Director of the United States Patera and Trademark Office 
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